{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word Embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Word embedding is a mapping of a word to a d-dimensional vector space.\n",
    "This real valued vector representation captures semantic and syntactic features.\n",
    "Polyglot offers a simple interface to load several formats of word embeddings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from polyglot.mapping import Embedding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Formats\n",
    "\n",
    "The Embedding class can read word embeddings from different sources:\n",
    "\n",
    "- Gensim word2vec objects: (`from_gensim` method)\n",
    "- Word2vec binary/text models: (`from_word2vec` method)\n",
    "- GloVe models (``from_glove`` method)\n",
    "- polyglot pickle files: (`load` method)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "embeddings = Embedding.load(\"/home/rmyeid/polyglot_data/embeddings2/en/embeddings_pkl.tar.bz2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "##Nearest Neighbors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A common way to investigate the space capture by the embeddings is to query for the nearest neightbors of any word."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'blue',\n",
       " u'white',\n",
       " u'red',\n",
       " u'yellow',\n",
       " u'black',\n",
       " u'grey',\n",
       " u'purple',\n",
       " u'pink',\n",
       " u'light',\n",
       " u'gray']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "neighbors = embeddings.nearest_neighbors(\"green\")\n",
    "neighbors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "to calculate the distance between a word and the nieghbors, we can call the `distances` method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.34894466,  1.37864077,  1.39504588,  1.39524949,  1.43183875,\n",
       "        1.68007386,  1.75897062,  1.88401115,  1.89186132,  1.902614  ], dtype=float32)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embeddings.distances(\"green\", neighbors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The word embeddings are not unit vectors, actually the more frequent the word is the larger the norm of its own vector."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZYAAAESCAYAAADe2fNYAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xm8HFWZ//HPAwECybAPCUs0CLgxkAAWIoMmVBQUHWRG\nRtGfKOi4oKgIFgrjAoqi1jAjuAA6DkEdRQcHhBEETUHEQaUQbgQjSgI6LAYFlMimYM7vjzqdU7dz\nl773dld13f6+X69+3dq6+vQDuc+ts5pzDhERkW7ZqO4CiIjI9KLEIiIiXaXEIiIiXaXEIiIiXaXE\nIiIiXaXEIiIiXVVLYjGzjc3sZjO7fJTz55jZ7Wa2wsz2qbp8IiIyeXU9sbwLWAlsMIjGzA4DdnfO\n7QG8GTi34rKJiMgUVJ5YzGwX4DDg3wEb4ZLDgQsBnHM/BrY2sznVlVBERKaijieWfwMSYN0o53cG\n7irt3w3s0utCiYhId1SaWMzsZcBvnXM3M/LTyvpL2/Y174yISEPMqPjzDgQO9+0oM4EtzexLzrnX\nla65B5hX2t/FHxvm8MMPd48//jhz584FYNasWey+++4sXLgQgKGhIYCB2G9t90t56txvj0nd5alz\nf9WqVRx55JF9U5469y+++OKB/v1w1VVXATB37lxmzZrFueeeO9Yf9lPnnKvlBSwCLh/h+GHAFX77\nAOBHI73/6KOPdnWVvd9ewGl1l6FfXoqFYqFYjP2q4ndn1U8s7RyAmb0FwDl3vnPuCjM7zMxWAY8A\nx470xjVr1lRXyv43v+4C9JH5dRegj8yvuwB9ZH7dBRgktSUW59xyYLnfPr/t3PG1FEpERKassSPv\nDz300LqL0E+W1l2APrK07gL0kaV1F6CPLK27AP1iwYIFPf+MxiaWViOVgHPu2rrL0C8Ui0CxCBSL\noIrfnY1NLOVeQIPOzBbXXYZ+oVgEikWgWFSrsYlFRET6U2MTy8KFC4nSbFbd5egHeswPFItAsQgU\ni2o1NrF429RdABERGa6xicW3sWiqF1R/XKZYBIpFoFhUq7GJxduk7gKIiMhwjU0svsucEguqPy5T\nLALFIlAsqtXYxOIpsYiI9JnGJhbfxqLEguqPyxSLQLEIFItqNTaxeEosIiJ9prGJRW0sgeqPA8Ui\nUCwCxaJajU0snhKLiEifaWxiURtLoPrjQLEIFItAsahWYxOLp8QiItJnGptY1MYSqP44UCwCxSJQ\nLKrV2MTiKbGIiPSZxiYWtbEEqj8OFItAsQgUi2o1NrF4SiwiIn2msYlFbSyB6o8DxSJQLALFolqN\nTSyeEouISJ9pbGJRG0ug+uNAsQgUi0CxqFZjE4s3o+4CiIjIcI1NLGpjCVR/HCgWgWIRKBbVamxi\n8ZRYRET6TGMTi9pYAtUfB4pFoFgEikW1GptYPCUWEZE+09jE4ttYDqy7HP1A9ceBYhEoFoFiUa3G\nJhYvrrsAIiIyXGMTi29j+Xzd5egHqj8OFItAsQgUi2o1NrF4GsciItJnGptYfBvLZnWXox+o/jhQ\nLALFIlAsqtXYxOIpsYiI9JnGJhbfxqLEguqPyxSLQLEIFItqNTaxeDPrLoCIiAxXeWIxs5lm9mMz\nGzKzlWZ25gjXLDazh8zsZv96f/s1amMJVH8cKBaBYhEoFtWqvFeVc+5xMzvYOfeomc0AfmBmBznn\nftB26XLn3OHj3E6JRUSkz9RSFeace9RvbgpsDDw4wmU21j3UxhKo/jhQLALFIlAsqlVLYjGzjcxs\nCLgPuMY5t7LtEgccaGYrzOwKM3v2KLda2NOCiojIhNX1xLLOObcQ2AV4wQh/TdwEzHPOLQA+DVza\nfo9Vq1Zxx9c/wcabzvyomZ1mZieU7+PbaQZi3zl3bT+Vp879Vl16v5Snzn1K+qE8de63x6Tu8lS5\n77eX+tdpvranp8w51/MPGbMAZh8AHnPO/csY19wJ7OecW19ltmzZMve+mwwgypP4xt6XVESk+ZYt\nW+aWLFkyZlPDVNXRK2x7M9vab28OvAi4ue2aOWZmfnt/igQ4rB2mlHV37nmh+1z7X2aDTLEIFItA\nsahWHXNt7QhcaGYbUSS2LzvnlpnZWwCcc+cDRwLHmdmTwKPAUWPc7wTgWz0us4iIdKj2qrDJKlWF\nkSdxTx/rRESmi2lZFdYDn6i7ACIiEjQ2sZTaWAZ+LIvqjwPFIlAsAsWiWo1NLCUn1F0AEREJGptY\n/FxhguZBKlMsAsUiUCyq1djEIiIi/amxiaU8ejRKs4Feolj1x4FiESgWgWJRrcYmljZal0VEpE80\nNrG0tbEMdM8w1R8HikWgWASKRbUam1jabF53AUREpNDYxNI2Q+eJdZWjH6j+OFAsAsUiUCyq1djE\n0kZPLCIifaKxiaWtjeWtdZWjH6j+OFAsAsUiUCyq1djEIiIi/amxiaV9FbQozS6oqSi1U/1xoFgE\nikWgWFSrsYnFO7C0fUxdhRARkaCxiWXhwoXkSfzDusvRD1R/HCgWgWIRKBbVamxiERGR/tTYxFJq\nY/mg/7mupqLUTvXHgWIRKBaBYlGtxiaWku/5n9Phu4iINF5jfxmXxrHc0dqI0qyn6zj3K9UfB4pF\noFgEikW1GptYSn5X2j6jtlKIiAjQ4MTSamPJk7jctnJglGab1lOi+qj+OFAsAsUiUCyq1djEMorF\nwPV1F0JEZJA1NrGMseb9flWWox+o/jhQLALFIlAsqtXYxNLmhvLOoC9VLCJSp8Ymlra5wvZvO/2q\nCotSO9UfB4pFoFgEikW1GptYxqHeYSIiNWlsYmlrY9my7fQnKixK7VR/HCgWgWIRKBbVamxiKcuT\n+I9th55XS0FERKS5iaV9PRag3GD/mgqLUjvVHweKRaBYBIpFtRqbWNrlSfwX4Ca/OyNKs72iNNum\nzjKJiAyixnbLHWUcy76l7Z8CfwJmVlKgGqn+OFAsAsUiUCyqNW2eWLyr2/Y3q6UUIiIDrLGJZYQ2\nFoAvtR+I0mzaj8RX/XGgWASKRaBYVKuxiWUUXxvh2I1Rmk237yki0rcq/4VrZjPN7MdmNmRmK83s\nzFGuO8fMbjezFWa2T/v5kdpY/EzH/znC7ab1uBbVHweKRaBYBIpFtSpPLM65x4GDnXMLgb2Bg83s\noPI1ZnYYsLtzbg/gzcC5nd4/T+LXAtu3HX7P1EotIiKdqqWKyDn3qN/cFNgYeLDtksOBC/21Pwa2\nNrM55QtGaWMBIE/iB7pW2AZQ/XGgWASKRaBYVKuWxGJmG5nZEHAfcI1zbmXbJTsDd5X27wZ2meDH\nnD6FIoqIyCTV9cSyzleF7QK8YJS/JtrXr3flnTHWY2n53qQL2DCqPw4Ui0CxCBSLao2bWMxsazP7\noJldYmbfLb3ax4xMmHPuIeDbwHPaTt0DzCvt7+KPrXfxxRdjZkvN7DT/OqGcoG48ecmM+/MrL2rt\nz5r3jCPK581ssfa1r33tT/d9v73Uv04bqxmhW8w5N/YFZt+lSECXAI+XTjnn3Bcn/IFm2wNPOuf+\nYGabA1cBpzvnlpWuOQw43jl3mJkdAHzKOXdA+T5nnXWWO+mkk9qfaoaJ0mw3YFXp0EvzJL5iomXu\nd2a2WH+RFRSLQLEIFItg2bJlbsmSJWP+7pyqTqZ02R/YwTn3py595o7AhWa2EUXC+rJzbpmZvQXA\nOXe+c+4KMzvMzFYBjwDHTvKz2mc9/naUZnsArwTOzpP4kUneV0RERtFJYrkeeCawohsf6Jy7heFz\nerWOn9+2f/xY9+mgjQU2TCwAt7duQZFgGk9/iQWKRaBYBIpFtTpJLMcAV5rZDyl6cbUeoZxz7sO9\nKliXPD7GuX+srBQiIgOkk15hH6Po/jsH2APY3b/26GG5xtVJA1SexGM3IE0T5Ua7QadYBIpFoFhU\nq5MnllcCz3DO3dvrwvTIacCpFIMxh4nSbNs8idsHZ4qIyBR08sRyJ/BErwsyUR22sZAn8el5Em8G\nfGCE03O7WqiaqP44UCwCxSJQLKrVyRPLl4BvmdmnKdpY1nPOZT0pVW/81QjHjgFOrrgcIiLTWidP\nLMdTdBH+GPDFtldtJjHI57wRjiUAUZo5/9p8ygWrgeqPA8UiUCwCxaJanTyx7Oac+0vPS9J7fxjp\nYJRm25R2VwBPr6Y4IiLT05hPLGY2A3jYzPpuid9O21hK1o5yvNx4f9nkSlMv1R8HikWgWASKRbXG\nTCzOuScpBhS2r2/SOHkS/wX4NPB1YLQqr+2qK5GIyPTUSRvLV4DLzewYM1tiZnHr1evCjWUyE6nl\nSfzOPImPypN4tIGTx7Q2ojR7epRm80a5rq+o/jhQLALFIlAsqtVJG8vb/M8PjXBu1y6WpS9EaeYo\nnlx+4Q/1dLI2EZHpZtwnFufcfP/atf1VRQFHM4k2lon4amvDz5Dc11R/HCgWgWIRKBbV6uSJBTPb\nA3gNsBPFuigXOed+2cuCVSAGRhuHc2hpexV6ahER6VgnC339HfAT4BkUPaieCdxoZi/vcdnGNNXF\navIkvoZiSYBx+TEus6f0gT2k+uNAsQgUi0CxqFYnTyxnAi93zl3TOuD/I30G+FaPylWJPIlzwKI0\nM2DdOJf/ET25iIiMq5NeYTsD17Ud+1+K5YJr0802Fj8L8me7dsOKqf44UCwCxSJQLKrVSWJZAbyn\ntWNmBpwI9H7h5ArlSXw88C91l0NEpOk6SSzHAf9kZr8xsxuAe4E3E7oh12KqbSyj+OhYJ6M0m9mL\nD50q1R8HikWgWASKRbU66W78c+BZFOuynOV/Pss5t7LHZatcnsQjzidW8liUZh+spDAiIg3VUXdj\n59wTbNjOUqsejmPJKLoiA+wGrG47f3qUZqf77R3zJF7Tq4J0SvXHgWIRKBaBYlGtcROLn4DyGGAh\nUO5y65xzr+tRuep0CLBNnsT3+33zo/FH8hvUU0xEZJhO2lguBN5FMTvw6rZXbXrUxkKexH8pJZWW\n5452fZRmm0dptmVPCtMh1R8HikWgWASKRbU6qQp7MbCrc+73vS5Mv8qT+IYozd4MfH6E048CRGm2\nf57EeZRme1NUkV1VaSFFRPqEOTdaLY+/wGwFcKhzrva2hLJly5a5JUuWVF4NFaXZfcAOo5x+JnCb\n3/574Ca/f0OexIt7XjgRkXFU8buzk6qwLwGXmtlrylPm1z1tfl3yJJ4DbDLK6dtK25cAv6ZY+2VR\nlGadxFpEpPE6+WX3DmAOxRiPJq953zV5Ej85ibft3H4gSrPXR2l2lZ9SZtJUfxwoFoFiESgW1Rq3\njcU5N7+CcjTR3wC3TuD63YC7AKI0eyPwNOBUf+5c4K1dLZ2ISE06GsfSj3q8Hksnbhv/kmGuidJs\nFkU12r+3nXvpVAqiPvqBYhEoFoFiUS3V+09SnsR/Ke3u2OHbHgFGGt1f64SeIiLd1NjEUmcbS8lu\nwF51j75X/XGgWASKRaBYVKuxVWH9IE/iO0q7WwNvomgvebh0/BSKNW1ERAbCuONY+lVd41g6UZoC\n5ljgYopFwsbzRuAyigb9M/IkfrBHxRORAdYv41iGMbN9zeyrZvZRM9vCzPYws3/uReEabCNgRp7E\nS/MkfnjcqwtfBH4HvBt4YKpdkEVE6jKZqrC/A94CzKP4JXgOsJhx1jLptqGhIZYsWVLlR3bMr0jZ\n3ri/VZ7Ev4jS7D8onmTGsy5Ks3+gGEM0G3gJ8MY8iX/VfqGZLVavl4JiESgWgWJRrckklp9SrMdy\nA7DSzA6naF+QUfjG/TV++w3AG6I0+xywCHj2GG/977b9O9FsyiLS5ybTK+wW4G9bO865y4CPda1E\nHeqDcSxTkifx2/Ik3hPYdSLvG6mKTH+JBYpFoFgEikW1JpRY/Hr3BwHXl4875y6ZwD3mmdk1ZvYz\nM7vVzN45wjWLzewhM7vZv94/kXI2yUhVW+NYN8b6MCIitRszsZjZduV9V1gK7OpnPZ6MJ4B3O+f2\nBA4A3m5mzxrhuuXOuX3864z2k30yjqVbXj7RN0RpNqe1rT76gWIRKBaBYlGt8Z5YjhzpoHPuIuCG\nyXygc26Nc27Ibz8M/BzYaYRLB6YtIU/iyyim2T8eeB6wroO3rVDPMRHpR+M13h9rZg8D1zjn7m07\n95OpfriZzQf2AX7cdsoBB/qnonuA9zjnVpYvaHobS7s8iS8t7W4cpdmlhCeZ4ygGXpbNoagWm+2c\nuzZKs6cB3weuypP4jb0vcX9SXXqgWASKRbXGe2J5CPhH4FYz+6WZnWdmrzKzORRVWpNmZrMpBg++\nyz+5lN0EzHPOLQA+DVza/v4BUJ7t+LIxrns4SrNrKJaK3pmix9lNPS2ZiMgYxkssH3DOHQFsD7wa\nWAW8DvgFcNZkP9TMNgG+CXzFObdB0nDO/dE596jfvhLYxMy2LV9z9tlnY2ZLzew0/zqhXI/qOwA0\ndv/Gk5c88/78O98AzsmT+N47LjrzTWtXh3altauHaO2vXT20uLwP7LPb0R9abmaLozRbGqXZN+v+\nPlXtt471S3lq3j+hz8pT5/60+v0wkX2/vdS/TquifXpSU7qY2QzgE865kybxXgMuBB5wzr17lGvm\nAL91zjkz2x/4Rvu6MGeddZY76aSTBq6NIUqzRcC15WNrVw+x5W7jVg3Oy5P47l6Vq1+YaSBci2IR\nKBZBFVO6THquMDNb2GqEn+D7DqJoC/gpRVsKFPNjPQXAOXe+mb2dol3hSeBR4ETn3I/K9+nnucJ6\nbYrdjWfmSfynrhVGRBqlrxNL3QY5scCUk0sCpBTT/t/pp6ARkQHQl5NQ9otpNo5lMvakeNLbt9S2\nsrjD96b+52r8gEv/anwniXI986BTLALFolpaj6Wh8iReCawEMFty8HM+ueyGPIkfjdJsKrd9eZRm\nz8uT+IddKaSIDKTGJpbpNo5lKtoaJbegaJc6AfjUJG53fZRmbwHOp5ix+gx/nzl5Ev/9FIvac2qg\nDRSLQLGoVmOrwmRkeRI/liexUYz/mazz/c9/Bh6jWCbhiNY0MlGaWZRmc6dWUhGZrhqbWNTGEoxU\nf5wn8TqKJ9KNS4en+pi3Jkqz7wF3AL+J0uz2Kd6v61SXHigWgWJRrcZWhcn48iRuLTZW7gFiUZod\nBXwvT+L7AaI0u4JiIbFOlFdX2z1Ks7l5Eq8p9VLbKk/itVMquIg0mroby3pRmg0BCyb4tkXAfcBt\nfn91nsS7d7VgItI16m4sVfvnSbznQKBcL7lbl8oiIg3V2Kqwfl7zvmrdmq4iT+JvR2l2HPA4cEHp\nuI0xIPPM9gNRmi0AbgWOoJhotOWmPIn3K103n2J6n2PzJL5jquUHTd1RplgEikW19MQiw+RJfF6e\nxEtHOD6RR+chiul4Lm47vm+UZn9b2r8TeAHFQM1RRWl2vR/AqZ5oIg3Q2MSicSxBj/4Sm03xi/95\nXb7vD8a7IEqzRVGa7eq3DyqV4dbx3qu/SgPFIlAsqqXGe+lYlGb7ExZl+wLwpi7efgvgExRT1cT+\n2CLgw/5nyz8CT+RJ/K0ufrbIwFDj/Rg0jiWosI9+Oejl5aT/CZhqT7BHgXcQkgrAcoYnFYD/Ai6N\n0uyIkW6i8QqBYhEoFtVqbGKR6uVJ/OfS7unAdhTrvHwxT+LVFP8/PXeEt54PPL/LxbnEt7s8GKXZ\ntuNfLiJVaWyvMLWxBBXXH8+gGAT5oN9v/cRPv38DwwdkrjfFCTJHsw2QAQujNDviOZ9cNrP0eUcA\npwFRnsRTWkq7idSuECgW1WpsYpF6+NH8D4574cg+SNFm0nIzsM+UCwULyt2hozSLgBNL5/8cpdlG\nrXVnojTbC9gpT+KruvDZItKmsVVhamMJmlJ/nCfxR3y35Y2ApwHPoeh5BnDFBLs0j8ivTXPiCKfW\nrztDsXrpd6I0e2mUZru0Xxil2V9FaTZrqmWpW1P+v6iCYlEtPbFI5fyTw50AUZrtBuyYJ/G9/pxF\nafYU4Ndj3GIm8ARwFPCfUyjK//gyvB34Zp7E90VpNhtY649vRPFvxOVJ/OQUPkdkoKi7sfSlKM2M\noq3G5UnsojS7BfgbGD5Yc4pLNI/nGGCp354J7Jon8W3+c18EXA0clifxlT0sg0hXac37MSixCIyb\nWL4O/B54a4+LsSXwMPAx4Lt5Evekl4JIN2gcyxjUxhIMeP3xLODlrZ3VX/nIG4B3ApvnSXxUnsTH\nUcxZ1kv3UgzufB+wrMef1bEB//9iGMWiWmpjkUbLk/hR4DJ8F2c7ecniPLnm2rZrvtU6P8ITzg4U\nAzw/NoVizAaS1k7bZ6wE9qNoz1kCHJMn8YWlazcB1pXWzhFpvMY+sWgcS6A++kEHsfh7imWbDwVe\nmifx7/IkPhP4TY+K9GyK5Z1bU3EvjdLsBIAozd4N/Bl40vdY2+DfY5RmG0Vp9rMozc4e4dyHojQ7\nw29v2t6TTf9fBIpFtdTGIgJEabYzcLff3QL4JbALsBnwr8DbR3jbbcAzu1yUz1I8PR0PfA64AtjL\nn9s6T+KHojTbmmKutiP98UMoOhIAbJEn8WPtN43S7DzgqRSdDZr5j166oorfnY2tCtN6LIHWmggm\nG4s8ie9h+IwB80rbx/tXezXXQoq1a7rp7YQkdkrbuTXA5hQdEsquLm2/H/jnKM02/umZrzlk71O+\n+lGKqro9/PnnA9/vbpH7n/6NVKuxVWEiNWmN1v9CnsR/8l2ftymdv4zil/tIPj/Fz57ZQffqU6M0\nWww8Of+VJ19BMbPBHqXzy0d7Y5RmC6M0O3CKZRRRVZhIL0RpZn78jcH6QaGjdY++l+GzRffaO/Ik\n/owvzy4UE4peRHjyOTBP4h+OdYMozZ5GsUDboXkSXz3WtdJfVBUm0lCtRDJGe8a/Ae/22/MpGvGr\n8ukozT7ry3aXP/aG0vlLojSbA/yJourtAOAc4LV5Ev8iSrNNCat+XkXbpKNRmh1FseT0M/Ik/lXv\nvob0q8ZWhWkcS6A++kEDYjGDYozNicBWwAw/8/IbKNakebhbH+TnTRvNujGq1eb4n5sB64DrKeZ1\nu82/50/li6M0mx2l2T9EabaVf0L7GrApYR442q63yczFFqXZsVGaTaqzRAP+v5hW9MQiUiE/XuUv\nfntt6fgFwAVRmm1O0RX6uxT/PjcBHvDVatswfGbp7YAH/PbbKHqR1WE1xXgggI+UT5SS18spvtPj\nFInnVVGa7Vp+ovFPQouB69p7tkVp9hNgX7+7QTWOf+/GI/WIk+qpjUWkQaI0+xnF2Jj2OdNmUoyX\nafkf4KUUMxPsRfEUsjNwbmWF7cz+wAKK6Xcuo0gsVwIfoljb5wN5Ep/R9nT1sjyJv93aidJsY6A1\nSeiRFNPqrGWC/DiilwI350l893jXN5XmChuDEovIcKVfvluN9IvVV1OdTbEEdJPMZsMqwp8Ce1N0\nOFhBaeYDb6NOx+v4armEYhDrQf7wZm0rpk5ZlGYzgLnAPXWOJdJcYWNQG0ug+uNgkGORJ7H511rY\nMBZ5Ers8id8JvAy4pHRq/w4/4mn+dX8XijsRI7U77e1/HsKGSQXgV1Ga3e7bfo7ZeOassQa9fYTi\nCemg0rHrJ1fUMf2BorPEuijNXtmD+/cNtbGIDBhfjfTtKM32BHbOkzgfZdno8pPC4jyJW2vozKNI\nRj9k9N5sEZB3teAT8xT/85sAux/zkdbS2J+h6FRwlj//bYrqr3b7wfqnmS0oJhj9F4pBqnPyJF4z\nkcJEabYdRbVky9eBb0zkHk2iqjARIUqzF1O0v+xC8Qfn4jyJRx1MWXrfIRSLre1D6Lr81jyJzy9d\n08xfMmP7L4qnvrcD2wO7AltT9OwrWwScQdHR4tnlE602Mr9s9rpeF7hFbSxjUGIR6S9Rmh0B7J0n\n8Yfbjm9B0ZPrSeD/UczDdl3pktP8q+UU4IvAJykWW5uuhiimBQL4ZZ7Ez6jiQ6dlYjGzecCXKLon\nOuDzzrlzRrjuHOAlFH8BHOOcu7l8/qyzznInnXSSEguaB6lMsQj6ORZRmj0XeAVFr68/RWl2KvBR\nIMqT+EZ/zQ7AfaW3vRR4AfBev386RdtIy3bAiyhmERhm7eohttyt72dE3xN4l3/tmCfxnT4GDwEX\nAz+g6Nn3DiDNk/jkkW7iO2lsnifxo76n2+Z5Ej/SOj9dE8tcYK5zbsjMZgM/AY5wzv28dM1hwPHO\nucPM7LnA2c65A8r3UWIJ+vkXSNUUi2A6xCJKs38lzFCwFUWX6tcBV+dJfFeUZm+jmBF67zyJb/Hv\nuZ8iyazXkMQyGa8DrqWYmfttFE+CFxDG/JR9LU/i10zLxLJBAcwuBT7tnFtWOnYecI1z7ut+/zZg\nkXNu/V8vqgoTGQx+YOjMPIk7WjPHV70dStFI/2CexP/nj28P/M5fNkSx7MFRbW//NMO7Y58LHDf5\n0m/gQuD1XbzfhORJbNM+sZjZfIrZVvd0zj1cOn45cKZz7nq//z3gvc65n7SuUWIRkckoN5b7wZVb\nU/SAe5TQlXp/4AY/48Ec4LfAM4CfU1TPzdngxp3ZjLYpcSo28+P7usen7SSUvhrsYuBd5aRSvqRt\nf1gGPPvss3nhC1+4FPiVP/QHYKj16N/qwz8I++XxCv1Qnjr322NSd3lq3l/onPtUH5Wnzv0T8L8f\n8iRe13b+ATPbq+36zYFFJO7aPInvK11vUZrZ2tVD64D11Wutedn8/tybP/TyvXZ73enfLZ9/4CdX\nv/X+/Dt/jtLst2tXD+0wxvu7ur929RD331is9rDZNnMfH9po756vZVXLE4uZbUIx5cSVrf/x286f\nB1zrnLvI729QFaY2lmA61KV3i2IRKBZBt2Phlw14KsWgygUUHRGAYd2IjwA+AbwiT+JbS++NKKar\nuTZP4oOjNPs1YdwNwFrgU8AHKX5Pvqzt4xdTtKtMysf3dUy7qjAzM4p6xgecc+8e5Zpy4/0BwKfa\nG+9VFSYi/aI0Vudf8yQ+aYLvLc91tsFUMlGaHU3RkxYoEtcoY4MOBq4Z7/OqSCx1VIX9LfBa4Kdm\n1upCfCo+YzvnznfOXWFmh5nZKuAR4Ngayiki0qkFFMs+nz/ehe38jNej/qLPk/jLwJejNJuZJ3Fr\nKeytKOabj91iAAAK5ElEQVR9WwrcmifxAwBRmr2CYraEq0q3mAW8mWINoPOBt0y0jBNVe6+wyVJV\nWKAqj0CxCBSLYNBj4ce2PANY9fF93RPT8YlFREQq5GdTvg1g2bJl41w9dY2d3Xjhwmk52GlSBvkv\nsXaKRaBYBIpFtRqbWEREpD81NrFoPZZgkNcgaadYBIpFoFhUq7GJRURE+lNjE4vaWALVHweKRaBY\nBIpFtRqbWEREpD81NrGojSVQ/XGgWASKRaBYVKuxiUVERPpTYxOL2lgC1R8HikWgWASKRbUam1hE\nRKQ/NTaxqI0lUP1xoFgEikWgWFSrsYlFRET6U2MTi9pYAtUfB4pFoFgEikW1GptYRESkPzU2saiN\nJVD9caBYBIpFoFhUq7GJRURE+lNjE4vaWALVHweKRaBYBIpFtRqbWEREpD81NrGojSVQ/XGgWASK\nRaBYVKuxiUVERPpTYxOL2lgC1R8HikWgWASKRbUam1hERKQ/NTaxqI0lUP1xoFgEikWgWFSrsYlF\nRET6U2MTi9pYAtUfB4pFoFgEikW1GptYRESkPzU2saiNJVD9caBYBIpFoFhUq7GJRURE+lNjE4va\nWALVHweKRaBYBIpFtRqbWEREpD81NrGojSVQ/XGgWASKRaBYVKuxiUVERPpTYxOL2lgC1R8HikWg\nWASKRbUam1hERKQ/VZ5YzOw/zOw+M7tllPOLzewhM7vZv94/0nVqYwlUfxwoFoFiESgW1arjieUC\n4MXjXLPcObePf50x0gWrVq3qfsmaS/WCgWIRKBaBYuFV8Ud55YnFOXcd8PtxLrPx7vPII490p0DT\nw9Z1F6CPKBaBYhEoFt6KFSt6/hn92MbigAPNbIWZXWFmz667QCIi0rkZdRdgBDcB85xzj5rZS4BL\ngae3X7RmzZrKC9bH5tddgD4yv+4C9JH5dRegj8yvuwCDxJxz1X+o2XzgcufcXh1ceyewn3PuwfLx\n4447zpWrwxYsWDCwXZCHhoYG9ru3UywCxSIY5FgMDQ0Nq/6aNWsW55577rjNDVPRd4nFzOYAv3XO\nOTPbH/iGc25+tSUUEZHJqrwqzMy+BiwCtjezu4APAZsAOOfOB44EjjOzJ4FHgaOqLqOIiExeLU8s\nIiIyffVjr7BxmdmLzew2M7vdzN5bd3m6wczmmdk1ZvYzM7vVzN7pj29rZt81s1+a2dVmtnXpPaf4\nGNxmZoeUju9nZrf4c2eXjm9mZl/3x39kZk+t9ltOjJlt7AfJXu73BzIWZra1mV1sZj83s5Vm9twB\njsUp/t/ILWb2VV/2gYjFSIPLq/ruZvZ6/xm/NLPXjVtY51yjXsDGwCqKXh6bAEPAs+ouVxe+11xg\nod+eDfwCeBbwSeBkf/y9wMf99rP9d9/Ex2IV4Qn0BmB/v30F8GK//Tbgc377VcBFdX/vcWJyIvCf\nwGV+fyBjAVwIvMFvzwC2GsRY+O9zB7CZ3/868PpBiQXwfGAf4JbSsZ5/d2BbYDXFWKCtW9tjlrXu\nYE0iuM8DvlPafx/wvrrL1YPveSnwQuA2YI4/Nhe4zW+fAry3dP13gAOAHYGfl44fBZxXuua5fnsG\n8Lu6v+cY338X4HvAwRQdPRjEWFAkkTtGOD6IsdiW4g+ubXw5LwdeNEixoEgS5cTS8+8OvBo4t/Se\n84CjxipnE6vCdgbuKu3f7Y9NG77X3D7Ajyn+p7nPn7oPmOO3d6L47i2tOLQfv4cQn/Wxc849CTxk\nZtt2/xt0xb8BCbCudGwQY7Er8Dszu8DMbjKzL5jZLAYwFq4YcnAW8H/AvcAfnHPfZQBjUdLr777d\nGPcaVRMTy7TubWBms4FvAu9yzv2xfM4Vfy5M6+8PYGYvo+hyfjOjTO8zKLGg+MtxX4oqin2BRyie\n0tcblFiY2W7ACRR/te8EzDaz15avGZRYjKSfvnsTE8s9wLzS/jyGZ9PGMrNNKJLKl51zl/rD95nZ\nXH9+R+C3/nh7HHahiMM9frv9eOs9T/H3mgFs5doGnvaJA4HDrRgc+zUgNrMvM5ixuBu42zmX+/2L\nKRLNmgGMxXOA651zD/i/qP+bomp8EGPR0ut/Ew+McK9xf+c2MbHcCOxhZvPNbFOKRqbLai7TlJmZ\nAV8EVjrnPlU6dRlFAyX+56Wl40eZ2aZmtiuwB3CDc24NsNb3HDLgaOBbI9zrSGBZz77QFDjnTnXO\nzXPO7UpRB5w5545mMGOxBrjLzFrTGr0Q+BlF+8JAxYKiPeEAM9vcf4cXAisZzFi0VPFv4mrgECt6\nJ25D0a511ZilqrsxapINWC+haMRbBZxSd3m69J0OomhPGAJu9q8XUzRYfg/4pf8PvHXpPaf6GNwG\nHFo6vh9wiz93Tun4ZsA3gNuBHwHz6/7eHcRlEaFX2EDGAlgA5MAKir/StxrgWJxMkVhvoegtt8mg\nxILi6f1e4M8UbSHHVvXd/Wfd7l+vH6+sGiApIiJd1cSqMBER6WNKLCIi0lVKLCIi0lVKLCIi0lVK\nLCIi0lVKLCIi0lVKLCJTZGan+ZkBqv7c+Wa2zsz071j6iv6HlGnHr0NxRdux20c59soufOSog8HM\nbLH/5f9HM1vr17N4cxc+U6RvKbHIdLQcONBPWdGaQ2kGsLD1170/thvw/Ync2Mw2nkR57nHO/ZVz\nbkvgXcDnzGzPSdxHpBGUWGQ6upFiqo+Ffv/5wDUU016Uj612zq0xs53M7DIze8A/xfxT60a+muti\nM/uymT0EvN7MdjWz5f4J5Gpg+04L5py7EniAYhE3zGwbM/sfM/utmT1oZpeb2fopyc3sWjP7sJn9\nwH/eVX4q8w2Y2SvM7E4ze3an5RHpBSUWmXacc3+mWMtmkT/0AuA64Ad+u3Vsud++iGKNjx0pJt/7\nmJkdXLrl4cB/Oee2Ar7qXzmwHfARion7xp0bycw2MrPDKeb6url1mGLy0af412PAZ9re+mrgGGAH\nYFPgPRve2o4FPg4scc6tHK8sIr2kxCLT1XJCEjmIosrrutKx5wPLzWwexTT973XO/dk5twL4d6C8\nrvf1zrnWDNo7UEzf/gHn3BPOuesoZtcdcd0Ybycz+z3wKHAJcLRzbjUUi1c55y5xzj3unHsY+Bgh\nIUKRsC5wzq1yzj1OMUngwrb7v5si2Sxyzt0xfmhEekuJRaar7wMH+Wm+/9r/Iv8hRdvLNsCe/pqd\ngAedc4+U3vt/DF8hr7z2xE7A751zj5WO/XqcstzrnNsG2BI4Gzi11NazhZmdb2a/8lVty4GtWu1D\n3prS9mPA7Lb7nwR81jl37zjlEKmEEotMVz+iqHJ6E/C/AM65tRTTjr+Z4pf9r/3+tn7lzpanMDyZ\nlKu5fgNsY2ZblI49lQ6qwnwV3Xt9uY72h08Cng7s76vaFlE8/Yz1BNTuEOD9ZvYPE3iPSM8osci0\n5J8obgROZHjPrx/4Y8v9dXcB1wNnmtlmZrY38AbgK6Pc99f+vqeb2SZmdhDwsgmU6wmKddtP9odm\nUzyFtNZW/9AIbxsvyfyMYu2ez5rZ33VaFpFeUWKR6Ww58NcUyaTlOopeXOVk82qKddTvpVhI64PO\nucyfG2kd8dcAzwUeBD5IseDUWNrf/x/ADr4h/1PA5sD9FAnuyhGud23b7fs4535KkeC+YGaHjlMe\nkZ7SQl8iItJVemIREZGuUmIREZGuUmIREZGuUmIREZGuUmIREZGuUmIREZGuUmIREZGuUmIREZGu\nUmIREZGu+v8K14k92rmkbAAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x7f301c9c9e90>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "norms = np.linalg.norm(embeddings.vectors, axis=1)\n",
    "window = 300\n",
    "smooth_line = np.convolve(norms, np.ones(window)/float(window), mode='valid')\n",
    "plt.plot(smooth_line)\n",
    "plt.xlabel(\"Word Rank\"); _ = plt.ylabel(\"$L_2$ norm\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This could be problematic for some applications and training algorithms.\n",
    "We can normalize them by $L_2$ norms to get unit vectors to reduce effects of word frequency, as the following"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "embeddings = embeddings.normalize_words()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "white   0.4261\n",
      "blue    0.4451\n",
      "black   0.4591\n",
      "red     0.4786\n",
      "yellow  0.4947\n",
      "grey    0.6072\n",
      "purple  0.6392\n",
      "light   0.6483\n",
      "pink    0.6574\n",
      "colour  0.6824\n"
     ]
    }
   ],
   "source": [
    "neighbors = embeddings.nearest_neighbors(\"green\")\n",
    "for w,d in zip(neighbors, embeddings.distances(\"green\", neighbors)):\n",
    "  print(\"{:<8}{:.4f}\".format(w,d))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Vocabulary Expansion"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from polyglot.mapping import CaseExpander, DigitExpander"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not all the words are available in the dictionary defined by the word embeddings.\n",
    "Sometimes it would be useful to map new words to similar ones that we have embeddings for.\n",
    "\n",
    "### Case Expansion\n",
    "For example, the word `GREEN` is not available in the embeddings,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"GREEN\" in embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we would like to return the vector that represents the word `Green`, to do that we apply a case expansion:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "embeddings.apply_expansion(CaseExpander)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"GREEN\" in embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'White',\n",
       " u'Black',\n",
       " u'Brown',\n",
       " u'Blue',\n",
       " u'Diamond',\n",
       " u'Wood',\n",
       " u'Young',\n",
       " u'Hudson',\n",
       " u'Cook',\n",
       " u'Gold']"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embeddings.nearest_neighbors(\"GREEN\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Digit Expansion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We reduce the size of the vocabulary while training the embeddings by grouping special classes of words.\n",
    "Once common case of such grouping is digits.\n",
    "Every digit in the training corpus get replaced by the symbol `#`.\n",
    "For example, a number like `123.54` becomes `###.##`.\n",
    "Therefore, querying the embedding for a new number like `434` will result in a failure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"434\" in embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To fix that, we apply another type of vocabulary expansion `DigitExpander`.\n",
    "It will map any number to a sequence of `#`s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "embeddings.apply_expansion(DigitExpander)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\"434\" in embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, the neighbors of the new number `434` will be other numbers:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[u'##',\n",
       " u'#',\n",
       " u'3',\n",
       " u'#####',\n",
       " u'#,###',\n",
       " u'##,###',\n",
       " u'##EN##',\n",
       " u'####',\n",
       " u'###EN###',\n",
       " u'n']"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embeddings.nearest_neighbors(\"434\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Demo"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Demo is available [here](https://bit.ly/embeddings)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "### Citation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This work is a direct implementation of the research being described in the [Polyglot: Distributed Word Representations for Multilingual NLP](http://www.aclweb.org/anthology/W13-3520) paper.\n",
    "The author of this library strongly encourage you to cite the following paper if you are using this software."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```\n",
    "@InProceedings{polyglot:2013:ACL-CoNLL,\n",
    " author    = {Al-Rfou, Rami  and  Perozzi, Bryan  and  Skiena, Steven},\n",
    " title     = {Polyglot: Distributed Word Representations for Multilingual NLP},\n",
    " booktitle = {Proceedings of the Seventeenth Conference on Computational Natural Language Learning},\n",
    " month     = {August},\n",
    " year      = {2013},\n",
    " address   = {Sofia, Bulgaria},\n",
    " publisher = {Association for Computational Linguistics},\n",
    " pages     = {183--192}, \n",
    " url       = {http://www.aclweb.org/anthology/W13-3520}\n",
    "}\n",
    "```"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
